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ABSTRACT 


Peer assessment has been widely applied across diverse aca- 
demic fields over the last few decades, and has demonstrated 
its effectiveness. However, the advantages of peer assess- 
ment can only be achieved with high-quality peer reviews. 
Previous studies have found that high-quality review com- 
ments usually comprise several features (e.g., contain sug- 
gestions, mention problems, use a positive tone). Thus, re- 
searchers have attempted to evaluate peer-review comments 
by detecting different features using various machine learn- 
ing and deep learning models. However, there is no single 
study that investigates using a multi-task learning (MTL) 
model to detect multiple features simultaneously. This pa- 
per presents two MTL models for evaluating peer-review 
comments by leveraging the state-of-the-art pre-trained lan- 
guage representation models BERT and DistilBERT. Our 
results demonstrate that BERT-based models significantly 
outperform previous GloVe-based methods by around 6% in 
Fl-score on tasks of detecting a single feature, and MTL 
further improves performance while reducing model size. 


Keywords 
Peer assessment, peer feedback, automated peer-assessment 
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1. INTRODUCTION 


Peer assessment is a process by which students give feedback 
on other students’ work based on a rubric provided by the 
instructor [20, 24]. This assessment strategy has been widely 
applied across diverse academic fields, such as computer sci- 
ence [28], medicine [27], and business [1]. Furthermore, mas- 
sive open online courses (MOOCs) commonly use peer as- 
sessment to provide feedback to students and assign grades. 
There is abundant literature [7, 24, 25, 11] demonstrating 
the efficacy of peer assessment. For example, Doubling et al. 
[7] conducted a meta-analysis of 54 controlled experiments 
for evaluating the effect of peer assessment across subjects 
and domains. The results indicate that peer assessment is 
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more effective than teacher assessment, and also remarkably 
robust across a wide range of contexts [7]. 


However, low-quality peer reviews are a persistent problem 
in peer assessment, and considerably weaken the learning 
effect [17, 22]. The advantages of peer assessment can only 
be achieved with high-quality peer reviews [14]. This sug- 
gests that peer reviews should not be simply transmitted 
to other students but rather should be vetted in some way. 
Course staff could check the quality of each review comment? 
and assess its credibility manually, but this is not efficient. 
Sometimes (e.g., for MOOCs), this is not remotely possi- 
ble. Therefore, to ensure the quality of peer reviews and 
the efficiency of evaluating their quality, the peer-assessment 
platform should be capable of assessing peer reviews auto- 
matically. We call this Automated Peer-Review Evaluation. 


Previous research has determined that high-quality review 
comments usually comprise several features [14, 2, 25]. Ex- 
amples of such features are, “contains suggestions”, “men- 
tions problems”, “uses a positive tone”, “is helpful”, “is local- 
ized” [14]. Thus, one feasible and promising way to evalu- 
ate peer reviews automatically is to adjudicate the quality 
of each review comment based on whether it comprises the 
predetermined features, by treating this task as a text classi- 
fication problem. If a peer-review comment does not contain 
some of the features, the peer-assessment platform could 
suggest that the reviewer should revise the review comment 
to add missing features. Additionally, containing sugges- 
tions, mentioning problems, and using a positive tone, are 
among the most essential features. Thus, we use them for 
this study. 


Previous work for automatically evaluating review comments 
has focused on tasks that detect a single feature. For exam- 
ple, Xiong and Litman [33] designed sophisticated features 
and used traditional machine-learning methods for identify- 
ing peer-review helpfulness. Zingle et al. [37] utilized differ- 
ent rule-based, machine-learning, and deep-learning meth- 
ods for detecting suggestions in peer-review comments. How- 
ever, to the best of our knowledge, no single study exists 
that investigates using a multi-task learning (MTL) model 
to detect multiple features simultaneously (as illustrated in 
Figure 1), albeit extensive research has been carried out on 


‘In some peer-assessment systems, reviews are “holistic”. In 
others, including the systems we are studying, each review 
contains a set of review comments, each comment gives a 
response to a different criterion in the rubric. 
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Figure 1: Illustration of the single-task and multi-task learning settings 


the topic of automated peer-review evaluation (e.g., [34, 32, 
33, 31, 30, 37, 13, 19, 8]). 


There are at least two motivations for using multi-task learn- 
ing (MTL) to detect features simultaneously. Firstly, the 
problem naturally leads itself well to MTL, due to multiple 
features usually needing to be employed for a comprehensive 
and precise evaluation of peer-review comments. If we treat 
this MTL problem as multiple independent single tasks, to- 
tal model size and prediction time will increase by a factor of 
the number of features used for evaluating review comments. 
Secondly, MTL can increase data efficiency. This implies 
that learning tasks jointly can lead to performance improve- 
ment compared with learning them individually, especially 
when training samples are limited [5, 36]. More specifically, 
MTL can be viewed as a form of inductive transfer learn- 
ing, which can help improve the performance of each jointly 
learned task by introducing an inductive bias [3]. 


Additionally, the pre-trained language model, BERT (Bidi- 
rectional Encoder Representations from Transformers) [6], 
has become a standard tool for reaching the state of the art 
in many natural language processing (NLP) tasks. BERT 
can significantly reduce the need for labeled data. There- 
fore, we propose multi-task learning (MTL) models for eval- 
uating review comments by leveraging the state-of-the-art 
pre-trained language representation models BERT and Dis- 
tilBERT. We first compare a BERT-based single-task learn- 
ing (STL) model with the previous GloVe-based STL model. 
We then propose BERT and DistilBERT based MTL models 
for jointly learning different tasks simultaneously. 


The rest of the paper is organized as follows: Section 2 
presents related work. Section 3 describes the dataset used 
for this study. The proposed single-task and multi-task text 
classification models are elaborated in Section 4. Section 5 
details the experimental setting and results. In Section 6, we 
conclude the paper, mention the limitations of our research, 
and discuss future work. 


2. RELATED WORK 


2.1 Automated Peer-Review Evaluation 

The earliest study on automated peer-review evaluation was 
performed by Cho in 2008 [4]. They manually broke down 
every peer review comment into review units (self-contained 


messages in each review comment) and then coded them 
as praise, criticism, problem detection, solution suggestion. 
Cho [4] utilized traditional machine learning methods, in- 
cluding naive Bayes, support vector machines (SVM), and 
decision trees, to classify the review units. 


Xiong et al. attempted to use features (e.g., counts of nouns, 
verbs) derived from regular expressions and dependency parse 
trees and rule-based methods to detect localization in the 
review units [32]. Then, they designed more sophisticated 
features by combining generic linguistic features mined from 
review comments and specialized features, and used SVM to 
identify peer-review helpfulness [33]. After that, Xiong et al. 
upgraded their models to comment-level (use whole review 
comment instead of review units as the input) [15, 16]. 


Then, researchers started to use deep neural networks on 
tasks of automated peer-review evaluation for improving ac- 
curacy. Zingle et al. compared rule-based machine-learning 
and deep neural-network methods for detecting suggestions 
in peer assessments, and the result showed that deep-learning 
methods outperformed other traditional methods [37]. Xiao 
et al. collected around 20,000 peer-review comments and 
leveraged different neural networks to detect problems in 
peer assessments [31]. 


2.2 Multi-Task Learning 


Multi-task learning (MTL) is an important subfield of ma- 
chine learning in which multiple tasks are learned simulta- 
neously [35, 5, 3] to help improve the generalization perfor- 
mance of all the tasks. A task is defined as {p(x), p(y|x), L)}, 
where p(x) is the input distribution, p(y|x) is the distribu- 
tion over the labels given the inputs, and L is the loss func- 
tion. For the MTL setting in this paper, all tasks have the 
same input distribution p(x) and loss function L, but differ- 
ent distributions over the labels given the inputs p(y|z). 


In the context of deep learning, all methods of MTL can 
be partitioned into two groups: hard-parameter sharing and 
soft-parameter sharing [3]. For hard-parameter sharing, the 
hidden layers are shared between all tasks while keeping sev- 
eral task-specific output layers. For soft-parameter sharing, 
each task has its independent model, but the distance be- 
tween the different models’ parameters is regularized. For 
this study, we use the hard-parameter sharing approach. 


526 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


Table 1: Sample Rubric Criteria 


Does the design incorporate all of the functionality required? 
Have the authors converted all the cases discussed in the test plan into automated tests? 
Does the design appear to be sound, following appropriate principles and using appropriate patterns? 


Table 2: Sample Data 


Peer-Review Comments (lower-cased) Sugg. Prob. Tone 
lots of good background details is given but the testing and implementation sections are missing. 0 1 1 
the explanation is clear to follow but it could also include some explanation of the use cases. 1 0 1 
only problem statement is explained and nothing about design. please add design and diagrams. 1 1 0 


3. DATA 


3.1 Data Source: Expertiza?’ 

The data in this study is collected from an NSF-funded 
peer-assessment platform, Expertiza. In this flexible peer- 
assessment system, students can submit their work and peer- 
review the learning objects (such as articles, code, and web- 
sites) of other students [9]. This platform supports multi- 
round peer review. In the assignments that provided the 
review comments for this study, two rounds of peer review 
(and one round of meta-review) were used: 


1. The formative-feedback phase: For the first round of 
review, students upload substantially complete projects. 
The system then assigns each student to review a set 
number of these submissions, based on a rubric pro- 
vided. Sample rubric criteria are provided in Table 
1. 


2. The summative-feedback phase: After students have 
had an opportunity to revise their work based on feed- 
back from their peers, final deliverables are submit- 
ted and peer-reviewed using a summative rubric. The 
rubric may include criteria such as “How well has the 
team has addressed the feedback given in the first re- 
view round?”. Many criteria in the rubric ask review- 
ers to provide a numeric rating as well as a textual 
comment. 


3. The meta-review phase: After the grading period is 
over, course staff typically assess and grade the reviews 
provided by students. 


For this study, all textual responses to the rubric crite- 
ria from the formative-feedback phase and the summative- 
feedback phase of a graduate-level software-engineering course 
are extracted to constitute the dataset. Each response to a 
rubric criterion constitutes a peer-review comment. All re- 
sponses from one student to a set of criteria in a single rubric 
are called a peer review or a review. In this study, we fo- 
cus on evaluating each peer-review comment. After filtering 
out review comments that only contain symbols and special 
characters, the dataset consists of 12,053 review comments. 
In the future, we will update the platform, and this type of 
review comments will be rejected by the system directly. 


3.2 Annotation Process 

One annotator who is a fluent English speaker and familiar 
with the course context annotated the dataset. For qual- 
ity control, 100 reviews were randomly sampled from the 


“https: / /github.com/expertiza/expertiza 


Table 3: Inter-Annotator Agreement (Cohen’s «) 
Label Suggestion Problem ‘Tone | Average 
Cohen’s Kappa | 0.92 0.84 0.87 | 0.88 


dataset and labeled by a second annotator who is also a 
fluent English speaker and familiar with the course con- 
text. The inter-annotator agreement between two annota- 
tors was measured by Cohen’s « coefficient, which is gen- 
erally thought to be a more robust measure than simple 
percent agreement calculation [12]. Cohen’s « coefficient for 
each label is shown in Table 3. The result suggests that the 
two annotators had almost perfect agreement (>0.81) [12]. 
Sample annotated comments are provided in Table 2. 


We define each feature (label) in the context of automated 
peer-review evaluation as follows: 


Suggestion: A comment is said to contain a suggestion if it 
mentions how to correct a problem or make improvements. 


Problem: A comment is said to detect problems if it points 
out something that is going wrong in peers’ work. 


Positive Tone: A comment is said to use a positive tone if it 
has an overall positive semantic orientation. 


3.3 Statistics on the Dataset 


The minority class for each label includes more than 20% 
of samples, and thus the dataset is mildly imbalanced. It 
consists of 12,053 peer-review comments, and the average 
number of words for each peer-review comment is 29. We 
found that most students (over three-quarters) use a pos- 
itive tone in their peer-review comments. Around half of 
the review comments mention problems with their peers’ 
work, but only one-fifth of review comments give sugges- 
tions. Characteristics of the dataset are shown in Table 4 
below, 


Table 4: Statistics on the Dataset 


Label Class %samples avg.4¢words max#words 
Sugg. 0 79.2% 22 922 
a 20.8% 58 1076 
Prob. 0 56.7% 22 479 
1 43.3% 38 1076 
Pos. Tone 0 22.2% 28 1040 
1 77.8% 29 1076 
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4. METHODOLOGY 

In this section, we first briefly introduce Transformer [26], 
BERT [6], and DistilBERT [21]. Then we describe BERT 
and DistilBERT based single-task and multi-task models. 


4.1 Transformer 

In 2017, Vaswani et al. published a groundbreaking paper, 
“Attention is all you need,” and proposed an architecture 
called Transformer, which significantly improved the perfor- 
mance of sequence-to-sequence tasks (e.g., machine trans- 
lation) [26]. The Transformer is entirely built upon self- 
attention mechanisms without using any recurrent or con- 
volutional layers. As shown in Figure 2, the Transformer 
consists of two parts: the left part is an encoder, and the 
right part is a decoder. The encoder block takes a batch of 
sentences represented as sequences of word IDs. Then the 
sequences pass through an embedding layer, and the posi- 
tional embedding adds positional information of each word. 


Output 
Probabilities 


Positional 
Encoding 


Positional 
Encoding EQ) © 


Inputs Outputs 
(shifted right) 


Figure 2: Architecture of the Transformer [26] 


The encoder block is then briefly introduced since BERT 
reuses it. Each encoder consists of two layers: a multi-head 
attention layer and a feed-forward layer. The multi-head 
attention layer uses the self-attention mechanism, which en- 
codes each word’s relationship with every other word in the 
same sequence, paying more attention to the most relevant 
ones. For example, the output of this layer for the word 
“like” in the sentence, “we like the Educational Data Mining 
conference 2021!” will depend on all the words in the sen- 
tence. However, it will probably pay more attention to the 
word “we” than to the words “data” or “mining.” 


4.2 BERT 


BERT is a state-of-the-art pre-trained language representa- 
tion model proposed by Devlin et al. [6]. It has advanced 
the state-of-the-art results in many NLP tasks and signif- 
icantly reduced the need for labeled data by pre-training 
on unlabeled data over different pre-training tasks. Each 
BERT model consists of 12 encoder blocks of the Trans- 


former model. The input representation is constructed by 
summing the corresponding token and positional embed- 
dings. The length of the output sequence is the same as 
the input length, and each input token has a corresponding 
representation in the output. The output of the first token 
‘ICLS]’ (a special token added to the sequence) is utilized 
as the aggregate representation of the input sequence for 
classification tasks [6]. 


The BERT framework consists of two steps: pre-training 
and fine-tuning. During pre-training, the model is trained 
on unlabeled data, BooksCorpus (800M words) and En- 
glish Wikipedia (2,500M words), over two pre-training tasks, 
Masked language model (MLM) and Next sentence predic- 
tion (NSP). For fine-tuning, the BERT model is first ini- 
tialized with the pre-trained parameters, and then all of 
the parameters are fine-tuned using labeled data from the 
downstream tasks (e.g., text classification). For this study, 
we use HuggingFace pre-trained BERT® to initialize mod- 
els and then fine-tune models with annotated peer-review 
comments for automated peer-review evaluation tasks. 


4.3 DistiIBERT 


Although BERT has shown remarkable improvements across 
various NLP tasks and can be easily fine-tuned for down- 
stream tasks, one main drawback of BERT is that it is very 
compute-intensive (i.e., it takes a huge amount of param- 
eters, ~110M parameters). Therefore, researchers are at- 
tempting to apply different methods for compressing BERT, 
including pruning, quantization, and knowledge distillation 
[10]. One of the compressed BERT models is called Dis- 
tilBERT [21]. DistiIBERT is compressed from BERT by 
leveraging the knowledge distillation technique during the 
pre-training phase. The authors [21] demonstrated that 
DistiIBERT has 40% fewer parameters and is 60% faster 
than the original BERT while retaining 97% of its language- 
understanding capabilities. We will investigate whether we 
can reduce model size while retaining performance for our 
task with DisilIBERT.* 


4.4 Input Preparation 

Text Preprocessing: First, URL links in peer-review com- 

ments are removed. Then, we lowercase all comments and 

leverage a spellchecker API” to correct typos and misspellings. 
Finally, two special tokens ([CLS], [SEP]) are added to each 

review comment, as required for BERT. The [CLS] token 

is added to the beginning of each review for classification 

tasks. The [SEP] token is added at the end of each review. 


Subword Tokenization: The tokenizer used for BERT is a 
subword tokenizer called “WordPiece” [29]. Traditional word 
tokenizers suffer the out-of-vocabulary (OOV) word prob- 
lem. However, a subword tokenizer could alleviate the OOV 
problem. It splits a text into subwords, which then are con- 
verted to token IDs. 


Input Representation: The token IDs are padded or trun- 
cated to 100 for each sequence and then pass through a 
trainable embedding layer to be converted to token embed- 
dings. The input representation for BERT is constructed by 
summing the token embeddings and positional embeddings. 


2https: //huggingface.co /bert-base-uncased 
“https: //huggingface.co/distilbert-base-uncased 


https: //pypi.org/project/pyspellchecker/ 
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Figure 3: BERT and DistilBERT based single-task and multi-task learning architectures 


4.5 Single-Task and Multi-Task Models 

As mentioned in the BERT paper [6] and other studies [23], 
the pre-trained BERT model can be fine-tuned with just one 
additional output layer to create state-of-the-art models for 
a wide range of tasks, including text classification. There- 
fore, only one dense layer is added on top of the original 
BERT or DistilIBERT model and used as a binary classifier 
for the single-task learning models. Three dense layers are 
added to the multi-task learning models, one for each label. 


5. EXPERIMENTS AND RESULTS 


In this section, we first introduce training details and eval- 
uation metrics and then show experimental results. 


5.1 Training 

Train/Test split: We find by experiments that increasing 
training size does not help the classifier when the number of 
training samples is over 5000. Therefore, 5000/2053/5000 
data samples are used for training/validation/testing. 


Loss Functions: For BERT and DistilBERT based Single- 
Task Learning (STL) models, the cross-entropy loss is used. 
For BERT and DistilIBERT base Multi-Task Learning (MTL) 
models, the cross-entropy loss is used for each task. The to- 
tal loss will be the sum of the cross-entropy loss of each 
task. 


Cost-Sensitive method: As mentioned in Section 3.3, the 
dataset is mildly imbalanced (minority class > 20%). Thus, 
a cost-sensitive method is used in this study for alleviating 
the problem of class imbalance and improving performance, 
by weighting the cross-entropy loss function during training 
based on the frequency of each class in the training set. 


Hyperparameters: As we mentioned in Section 4.2 and Sec- 
tion 4.3, we use HuggingFace pre-trained BERT and Distil- 
BERT to initialize models. The hidden size for BERT and 
DistilBERT is 768. We then fine-tune the BERT and Dis- 
tilBERT based single-task learning and multi-task learning 
models with a batch size of 32, max sequence length 100, 
learning rate 2e-5/3e-5/5e-5, epochs of 2/3, dropout rate 
0.1, and Adam optimizer with 3;=0.9 and 82=0.99. 


5.2 Evaluation Metrics 

We use accuracy, macro-F 1 score (average for each class of 
each label instead of each label), and AUC (Area Under 
ROC Curve) to evaluate models. Since the dataset is merely 
mildly imbalanced, accuracy can still be a useful metric. The 
Macro-F 1 instead of F1-score for the positive class is used, 
since both positive class and negative class for each label 
are important for our task. For this study, we mainly use 
accuracy and macro-F1 to compare different models. 


5.3 Results 


Table 5 shows the performance of all models when train- 
ing with a different number of training samples (1K, 3K, 
and 5K). The first column indicates the models (GloVe, 
BERT, DistilBERT) and training settings (single-task learn- 
ing (STL), multi-task learning (MTL)). 


RQ1 Does BERT outperform previous methods? 

We first implemented a baseline single-task learning model 
by leveraging pre-trained GloVe (Global Vectors for Word 
Representation)® [18] word embeddings. We added a Batch- 
Normalization layer on top of GloVe, and the aggregate 
representation of the input sequence for classification was 
obtained by AveragePooling the output of the BatchNor- 
malization layer. A dense layer was added on the top for 
performing classification. 


We compared GloVe and BERT for every single task. As 
shown in Table 5, the results clearly showed that a BERT- 
based STL model yields substantial improvements over the 
previous GloVe-based method. The STL-BERT model trained 
with 1000 data samples outperformed the STL-GloVe model 
trained with 5000 data samples on all tasks. This suggests 
that the need for labeled data could be significantly reduced 
by leveraging a pre-trained language model BERT. 


RQ2 How does multi-task learning perform? 

By comparing MTL-BERT with STL-BERT and MTL-Distil- 
BERT with STL-DistilBERT when trained with a different 
number of training samples, we found that jointly learning 


Shttps://nlp.stanford.edu/projects/glove/ 
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Table 5: Performance evaluation (average performance of 5 independent runs) 


Suggestion Problem Pos. Tone 

Acc. Macro-Fl1 AUC | Acc. Macro-Fl AUC | Acc. Macro-Fl1 AUC 
Training with 1000 labeled data samples 
1 STL-GloVe (Baseline) | 82.0% 744 865 | 80.2%  .790 879 | 76.0% -700 823 
2 STL-BERT 90.0% .868 975 | 89.2%  .892 955 | 87.0% .828 .940 
3 MTL-BERT 94.0% 904 974 | 89.0%  .890 955 | 89.4% 846 941 
4 STL-DistilBERT 92.4% .890 970 | 88.0%  .880 .950 | 86.2% 822 .933 
5 MTL-DistilIBERT 93.8% .910 971 | 89.0%  .886 951 | 88.6% 824 .939 
Training with 3000 labeled data samples 
1 STL-GloVe (Baseline) | 88.4% .836 929 | 83.0%  .830 898 | 82.4% 770 872 
2 STL-BERT 93.8% 910 980 | 90.6%  .904 964 | 89.6% .858 948 
3 MTL-BERT 94.6% 916 981 | 91.0%  .906 964 | 90.0% 854 .947 
4 STL-DistilBERT 94.0% 910 979 | 89.8%  .900 .962 | 89.0% .850 .942 
5 MTL-DistilBERT 94.2% 916 978 | 89.6%  .892 .960 | 90.2% .850 .945 
Training with 5000 labeled data samples 
1 STL-GloVe (Baseline) | 89.9% 852 947 | 84.2%  .832 908 | 85.0% 794 883 
2 STL-BERT 94.4% 916 980 | 91.2%  .912 968 | 89.4% 852 .950 
3 MTL-BERT 94.8% -922 -982 91.0%  .908 .966 90.8% 854 -951 
4 STL-DistilBERT 94.2% 912 978 | 90.4%  .902 .964 | 89.8% .860 .944 
5 MTL-DistilIBERT 94.2% 914 980 | 90.4%  .902 .964 | 90.6% 852 951 


Table 6: The # of parameters for each setting 


Setting # of parameters 
STL-BERT * 3 328M 
STL-DistiIBERT * 3 199M 
MTL-BERT 109M 
MTL-DistilBERT 66M 


related tasks improves the performance of the suggestion- 
detection task and the positive-tone detection task, espe- 
cially when we have limited training samples (i.e., when 
training with 1K and 3K data samples). This suggests that 
MTL can increase data efficiency. However, for the problem- 
detection task, there is no significant difference between the 
performance of the STL and MTL settings. 


Additionally, MTL can considerably reduce the model size. 
As shown in Table 6, three BERT-based STL models would 
have more than 328M parameters, and this number would 
be 199M for the DistilBert-based models. However, if we 
employ the MTL models for evaluating peer-review com- 
ments, the number of parameters would be reduced to 109M 
and 66M, respectively. This demonstrates that using MTL 
to evaluate reviews can save considerable memory resources 
and reduce the response time of peer-review platforms. 


RQ3 How does DistiIBERT perform? 

By comparing DistilBERT and BERT on both STL and 
MTL settings, we found that BERT-based models slightly 
outperformed DistilBERT-based models. This result im- 
plied a trade-off between performance and model size when 
selecting the model to be deployed on peer-review platforms. 
If we focus on high accuracy instead of memory resource 
usage and response time of the platforms, the MTL-BERT 
model is the choice. Otherwise, the MTL-DistilBERT should 
be deployed. 
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6. CONCLUSIONS 


In this study, we implemented single-task and multi-task 
models for evaluating peer-review comments based on the 
state-of-the-art language representation models BERT and 
DistilIBERT. Overall, the results showed that BERT-based 
STL models yield significant improvements over the previous 
GloVe-based method on tasks of detecting a single feature. 
Jointly learning different tasks simultaneously further im- 
proves performance and saves considerable memory usage 
and response time for peer-review platforms. The MTL- 
BERT model should be deployed on peer-review platforms, 
if our focus is on high accuracy instead of memory resource 
usage and response time of the platforms. Otherwise, the 
MTL-DistilBERT model is preferred. 


There are three limitations to this study. Firstly, we em- 
ployed three features of high-quality peer reviews to eval- 
uate a peer-review comment. However, it is still unclear 
how MTL will perform if we learn more tasks simultane- 
ously. Secondly, we mainly focused on a hard-parameter 
sharing approach for constructing MTL models. However, 
some studies have found that the soft-parameter sharing ap- 
proach might be a more effective method for constructing 
multi-task learning models. Thirdly, the performance of the 
model has not been evaluated in actual classes. We intend to 
deploy the model on the peer-review platform and evaluate 
the model extrinsically in real-world circumstances. 


These preliminary results serve as a basis for our ongoing 
work, in which we are building a more complex all-in-one 
model for comprehensively and automatically evaluating the 
quality of peer review comments to improve peer assessment. 
In the future, we will attempt to evaluate peer reviews based 
on more predetermined features and use fine-grained labels 
(e.g., instead of evaluating whether a peer-review comment 
contains suggestions, we will evaluate how many suggestions 
are contained in a review comment). 
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